exSTRa is an R package for the detection of repeat expansions with Illumina next-generation sequencing (NGS) data of cases and controls. exSTRa supports both whole-genome and whole-exome sequencing (WGS and WES). Only paired-end data is supported. This package implements the algorithm as described in:
Rick M. Tankard, Martin B. Delatycki, Paul J. Lockhart, Melanie Bahlo. Detecting known repeat expansions with standard protocol next generation sequencing, towards developing a single screening test for neurological repeat expansion disorders. bioRxiv 157792; doi: https://doi.org/10.1101/157792
A table of repeat expansion disorders for the human genome reference hg19 is included. Other references may be provided in the future, or these may be created by the user. The use of GRCh37, that does not include the ‘chr’ prefix in chromosome names, has not yet been tested, but this is intended for the near future.
This section describes the steps required to process your own data. If you wish to try exSTRa with our example data, you may skip to the next section, “Using the R exSTRa package”.
Data should be aligned to the hg19 reference genome. We recommend this is performed in local alignment mode, whicih allows more bases to be soft-clipped, and hence more tolerant of repeat expansions. We have had success finding expansions in data aligned in a non-local mode.
Optionally, duplicate marking may be performed. We have tested this with Picard tools. The effect of omitting this duplicate marking step has not been assessed.
Both BAM and CRAM formats are supported. The resulting BAM/CRAM file should be sorted and indexed, with each representing exactly one individual.
At present, BAM/CRAM files are summarized by an external Perl package, Bio::STR::exSTRa. Future versions of exSTRa may include this within the R package. For installation instructions, see the package page https://github.com/bahlolab/Bio-STR-exSTRa.
The README.md file of Bio::STR::exSTRa, describes the steps to process the BAM/CRAM files. This will produce a tab-delimited file that is ready for analysis with the R exSTRa package. The R exSTRa package includes an example of this output; assuming exSTRa is installed, this can be found on your system with
system.file("extdata", "HiSeqXTen_WGS_PCR_2.txt", package = "exSTRa")
or online at https://github.com/bahlolab/exSTRa/blob/master/inst/extdata/HiSeqXTen_WGS_PCR_2.txt.
The following describes the analysis of the WGS_PCR_2 data from the exSTRa publication described at the top of this vignette. WGS_PCR_2 is a data set of 16 case and 2 control individuals, sequenced to 60x and 30x depth (16 (14 cases) and 2 (case) samples, respectively) on an Illumina HiSeq X Ten system. A PCR library preparation protocol with Illumina TruSeq Nano was used.
Firstly, if not already, the exSTRa package should be installed:
# install.packages("devtools") # if devtools is not already installed
# install.packages("data.table") # if data.table is not already installed
devtools::install_github("bahlolab/exSTRa")
We load required packages. The data.table package should be loaded first, as exSTRa must redefine it’s copy() function as a generic function.
library(data.table)
library(exSTRa)Data output from Bio::STR::exSTRa is read, along with the specification of repeat expansion loci. This results in an exstra_score object.
str_score <- read_score (
file = system.file("extdata", "HiSeqXTen_WGS_PCR_2.txt", package = "exSTRa"),
database = system.file("extdata", "repeat_expansion_disorders.txt", package = "exSTRa"),
groups.regex = c(control = "^WGSrpt_0[24]$", case = "")
)
# an exstra_score object:
str_score
#> exstra_score object with 33072 observations of type named ($data),
#> for 18 samples. ($samples)
#> Includes associated STR database of 21 loci. ($db)When reading your own data, system.file(...) should be replaced with a string of the path to your data. The groups.regex vector does not currently impact the detection of repeat expansions, but does affect plotting. groups.regex should contain regular expressions to match sample names. The order of groups.regex matters, such that the first match will determine the grouping; this is why the “case” regular expression is an empty string that matches anything, hence specifying any non-control sample as a case.
Note the read_score() function automatically filters low repeat content scores according to the length of the repeat motif; shorter repeat motifs are more likely to appear by chance and hence have a higher threshold than longer repeat motifs, that are less likely to appear in a sequence by chance.
The loci are specified in str_score$db as a data.table.
str_score$db$locus
#> [1] "DM1" "DM2" "DRPLA" "EPM1A" "FRAXA" "FRAXE" "FRDA"
#> [8] "FTDALS1" "HD" "HDL2" "SBMA" "SCA1" "SCA10" "SCA12"
#> [15] "SCA17" "SCA2" "SCA3" "SCA36" "SCA6" "SCA7" "SCA8"We can plot an empirical distribution function (ECDF) of repeat scores for a chosen locus. This example is of theHuntington disease (HD) locus. Here, all case samples, including those for other loci, are colored red.
plot(str_score["HD"])We may also choose a locus directly from the plot() function. Here, we additionally set the color of two known HD cases. All other samples are automatically colored a transparent black.
plot(str_score, "HD", sample_col = c("WGSrpt_10" = "red", "WGSrpt_12" = "blue"))More information on the plot function on exstra_score objects can be accessed with
?`[.exstra_score`
For convenience and speed in this vignette, we only assess expansions for the Huntington disease, spinocerebellar ataxias 2 and 6, and Friedreich’s ataxia loci.
( str_score_four <- str_score[c("HD", "SCA2", "SCA6", "FRDA")] )
#> exstra_score object with 5468 observations of type named ($data),
#> for 18 samples. ($samples)
#> Includes associated STR database of 4 loci. ($db)For creating multiple ECDF curves, with legends, you may use the plot_multi() function. Here, four ECDFs are created both to the R session (plot_type 1) and to a single PDF file example_images/HiSeqXTen_WGS_PCR_2.pdf.
par(mfrow = c(2, 2))
plot_multi(str_score_four, dir = "example_images",
prefix = "HiSeqXTen_WGS_PCR_2", plot_types = c(1, 2), alpha_case = 0.2)
#> Warning in dir.create(dirbase, recursive = TRUE): 'example_images' already
#> existsHere, we calculate an aggregated T statistic over quantiles as described in the exSTRa paper. By default, p-values are calculated with a simulation procedure. With default settings, this may take several minutes to complete. We use the parallel package to speed this up; by default this uses 1 less thread than available on your system. This creates an exstra_tsum object.
( tsum <- tsum_test(str_score_four, parallel = TRUE) )
#> Simulating distribution for HD
#> Simulating distribution for SCA2
#> Simulating distribution for SCA6
#> Simulating distribution for FRDA
#> exstra_tsum object with 72 T sum statistics ($stats),
#> with p-values calculated ($stats),
#> over 4 loci. ($db)
#>
#> T sum statistics summary:
#> exSTRa T := sum of two sample t-tests
#>
#> Alternative hypotheses: subject sample has a larger allele than background samples.
#>
#> alpha Bonferroni unadjusted
#> 0.0001 0 6
#> 0.001 0 2
#> 0.01 6 1
#> 0.05 1 1
#> 1 65 62
#> NA 0 0
#>
#> Number of samples: 18
#> Number of loci: 4
#> Defined p-values: 72
#> NA p-values: 0
#> Function arguments: trim = 0.15, min.quant = 0.5, B = 9999Plotting an exstra_tsum object highlights significant samples, after Bonferroni correction by default.
par(mfrow = c(2, 2))
plot(tsum)You may manually set the colors each sample will use with a vector of colors, with names the corresponding sample name.
plot_cols <- c(RColorBrewer::brewer.pal(8, "Set2"), RColorBrewer::brewer.pal(8, "Dark2"), "orange", "blue")
names(plot_cols) <- str_score_four$samples[, sample]
plot_cols
#> WGSrpt_02 WGSrpt_04 WGSrpt_05 WGSrpt_07 WGSrpt_08 WGSrpt_09 WGSrpt_10
#> "#66C2A5" "#FC8D62" "#8DA0CB" "#E78AC3" "#A6D854" "#FFD92F" "#E5C494"
#> WGSrpt_11 WGSrpt_12 WGSrpt_13 WGSrpt_14 WGSrpt_15 WGSrpt_16 WGSrpt_17
#> "#B3B3B3" "#1B9E77" "#D95F02" "#7570B3" "#E7298A" "#66A61E" "#E6AB02"
#> WGSrpt_18 WGSrpt_19 WGSrpt_20 WGSrpt_21
#> "#A6761D" "#666666" "orange" "blue"
# To demonstrate the colours used:
par(mfrow = c(1, 1))
pie(rep(1, length(plot_cols)), col = plot_cols, labels = names(plot_cols), cex = 0.7)Note that some data sets may not be able to reach significant levels after correction with the default number of simulations (9999). This can be adjusted with the B parameter of tsum_test(), or a less stringent threshold can be used. Bonferroni correction is too severe here, so we can specify Bonferroni correction only on each locus.
par(mfrow = c(2, 2))
plot(tsum, sample_col = plot_cols, correction = "locus")You may obtain a data.table of each sample and locus with the p-value, and if it is significant with the correction method applied. Here, the correction method is Bonferroni per locus.
(ps <- p_values(tsum, correction = "locus"))
#> locus sample tsum p.value signif
#> 1: FRDA WGSrpt_02 0.22438025 0.4322 FALSE
#> 2: FRDA WGSrpt_04 2.37758148 0.0331 FALSE
#> 3: FRDA WGSrpt_05 -1.07940062 0.8214 FALSE
#> 4: FRDA WGSrpt_07 1.57425576 0.1042 FALSE
#> 5: FRDA WGSrpt_08 -0.79878551 0.7488 FALSE
#> 6: FRDA WGSrpt_09 12.15687245 0.0001 TRUE
#> 7: FRDA WGSrpt_10 -0.04995430 0.5219 FALSE
#> 8: FRDA WGSrpt_11 13.60845349 0.0001 TRUE
#> 9: FRDA WGSrpt_12 -0.94879960 0.7894 FALSE
#> 10: FRDA WGSrpt_13 0.57407040 0.3246 FALSE
#> 11: FRDA WGSrpt_14 1.63576022 0.0965 FALSE
#> 12: FRDA WGSrpt_15 -1.03104591 0.8106 FALSE
#> 13: FRDA WGSrpt_16 -1.07499268 0.8204 FALSE
#> 14: FRDA WGSrpt_17 -0.82994546 0.7567 FALSE
#> 15: FRDA WGSrpt_18 0.25955474 0.4202 FALSE
#> 16: FRDA WGSrpt_19 -1.00755997 0.8047 FALSE
#> 17: FRDA WGSrpt_20 -1.24996212 0.8584 FALSE
#> 18: FRDA WGSrpt_21 0.42092705 0.3681 FALSE
#> 19: HD WGSrpt_02 0.78651664 0.2494 FALSE
#> 20: HD WGSrpt_04 -1.17279494 0.8554 FALSE
#> 21: HD WGSrpt_05 0.72372001 0.2675 FALSE
#> 22: HD WGSrpt_07 -1.69374058 0.9369 FALSE
#> 23: HD WGSrpt_08 0.39030004 0.3720 FALSE
#> 24: HD WGSrpt_09 0.42849174 0.3610 FALSE
#> 25: HD WGSrpt_10 3.91108897 0.0010 TRUE
#> 26: HD WGSrpt_11 -0.60989099 0.7198 FALSE
#> 27: HD WGSrpt_12 4.76740655 0.0002 TRUE
#> 28: HD WGSrpt_13 -1.23399806 0.8672 FALSE
#> 29: HD WGSrpt_14 -0.91600406 0.8026 FALSE
#> 30: HD WGSrpt_15 0.95760367 0.2054 FALSE
#> 31: HD WGSrpt_16 0.65945956 0.2862 FALSE
#> 32: HD WGSrpt_17 -2.02905300 0.9654 FALSE
#> 33: HD WGSrpt_18 -0.53305756 0.6915 FALSE
#> 34: HD WGSrpt_19 1.38595973 0.1170 FALSE
#> 35: HD WGSrpt_20 0.40540384 0.3681 FALSE
#> 36: HD WGSrpt_21 -2.55875459 0.9879 FALSE
#> 37: SCA2 WGSrpt_02 -1.05875451 0.8557 FALSE
#> 38: SCA2 WGSrpt_04 -0.76845003 0.7776 FALSE
#> 39: SCA2 WGSrpt_05 -0.30080795 0.6142 FALSE
#> 40: SCA2 WGSrpt_07 -2.84199617 0.9970 FALSE
#> 41: SCA2 WGSrpt_08 0.46825467 0.3177 FALSE
#> 42: SCA2 WGSrpt_09 -2.12973781 0.9809 FALSE
#> 43: SCA2 WGSrpt_10 0.74965893 0.2300 FALSE
#> 44: SCA2 WGSrpt_11 1.02316649 0.1589 FALSE
#> 45: SCA2 WGSrpt_12 0.85434106 0.1996 FALSE
#> 46: SCA2 WGSrpt_13 -0.64930020 0.7403 FALSE
#> 47: SCA2 WGSrpt_14 0.73265499 0.2344 FALSE
#> 48: SCA2 WGSrpt_15 -0.21260246 0.5810 FALSE
#> 49: SCA2 WGSrpt_16 1.18776049 0.1236 FALSE
#> 50: SCA2 WGSrpt_17 -0.89585180 0.8143 FALSE
#> 51: SCA2 WGSrpt_18 8.17346266 0.0001 TRUE
#> 52: SCA2 WGSrpt_19 2.39053989 0.0099 FALSE
#> 53: SCA2 WGSrpt_20 5.76565693 0.0001 TRUE
#> 54: SCA2 WGSrpt_21 -1.52309932 0.9354 FALSE
#> 55: SCA6 WGSrpt_02 -3.79937114 0.9995 FALSE
#> 56: SCA6 WGSrpt_04 -0.07896286 0.5336 FALSE
#> 57: SCA6 WGSrpt_05 8.87692558 0.0001 TRUE
#> 58: SCA6 WGSrpt_07 9.00122726 0.0001 TRUE
#> 59: SCA6 WGSrpt_08 -0.68122300 0.7414 FALSE
#> 60: SCA6 WGSrpt_09 -0.77901515 0.7733 FALSE
#> 61: SCA6 WGSrpt_10 0.89980173 0.2014 FALSE
#> 62: SCA6 WGSrpt_11 -0.77233141 0.7707 FALSE
#> 63: SCA6 WGSrpt_12 0.54601268 0.3082 FALSE
#> 64: SCA6 WGSrpt_13 -0.16113762 0.5645 FALSE
#> 65: SCA6 WGSrpt_14 -0.78958334 0.7761 FALSE
#> 66: SCA6 WGSrpt_15 1.52409423 0.0793 FALSE
#> 67: SCA6 WGSrpt_16 0.55114384 0.3064 FALSE
#> 68: SCA6 WGSrpt_17 0.88474768 0.2053 FALSE
#> 69: SCA6 WGSrpt_18 -0.77629825 0.7720 FALSE
#> 70: SCA6 WGSrpt_19 -2.32054394 0.9837 FALSE
#> 71: SCA6 WGSrpt_20 1.57876816 0.0718 FALSE
#> 72: SCA6 WGSrpt_21 -4.16169382 0.9997 FALSE
#> locus sample tsum p.value signifTo obtain only the significant samples, you can either use data.table subsetting:
ps[identity(signif)]
#> locus sample tsum p.value signif
#> 1: FRDA WGSrpt_09 12.156872 1e-04 TRUE
#> 2: FRDA WGSrpt_11 13.608453 1e-04 TRUE
#> 3: HD WGSrpt_10 3.911089 1e-03 TRUE
#> 4: HD WGSrpt_12 4.767407 2e-04 TRUE
#> 5: SCA2 WGSrpt_18 8.173463 1e-04 TRUE
#> 6: SCA2 WGSrpt_20 5.765657 1e-04 TRUE
#> 7: SCA6 WGSrpt_05 8.876926 1e-04 TRUE
#> 8: SCA6 WGSrpt_07 9.001227 1e-04 TRUEor when retrieving the data.table from p_values():
p_values(tsum, only.signif = TRUE, correction = "locus")
#> locus sample tsum p.value signif
#> 1: FRDA WGSrpt_09 12.156872 1e-04 TRUE
#> 2: FRDA WGSrpt_11 13.608453 1e-04 TRUE
#> 3: HD WGSrpt_10 3.911089 1e-03 TRUE
#> 4: HD WGSrpt_12 4.767407 2e-04 TRUE
#> 5: SCA2 WGSrpt_18 8.173463 1e-04 TRUE
#> 6: SCA2 WGSrpt_20 5.765657 1e-04 TRUE
#> 7: SCA6 WGSrpt_05 8.876926 1e-04 TRUE
#> 8: SCA6 WGSrpt_07 9.001227 1e-04 TRUEWe may subset the exstra_score object by locus or samples. This is performed by x[loci] or x[loci, samples].
For a single locus:
exstra_wgs_pcr_2["HD"]
#> exstra_score object with 1621 observations of type named ($data),
#> for 18 samples. ($samples)
#> Includes associated STR database of 1 locus. ($db)A single sample:
exstra_wgs_pcr_2[, "WGSrpt_10"]
#> exstra_score object with 1971 observations of type named ($data),
#> for 1 samples. ($samples)
#> Includes associated STR database of 21 loci. ($db)The square brackets also allow filtering loci or samples with data.table syntax, on the x$db and x$samples. For example, we can extract all triplet repeats:
exstra_wgs_pcr_2[unit_length == 3]
#> exstra_score object with 24889 observations of type named ($data),
#> for 18 samples. ($samples)
#> Includes associated STR database of 16 loci. ($db)
exstra_wgs_pcr_2[unit_length == 3]$db$locus
#> [1] "DM1" "DRPLA" "FRAXA" "FRAXE" "FRDA" "HD" "HDL2" "SBMA"
#> [9] "SCA1" "SCA12" "SCA17" "SCA2" "SCA3" "SCA6" "SCA7" "SCA8"or extract all coding repeats:
exstra_wgs_pcr_2[gene_region == "coding"]
#> exstra_score object with 15275 observations of type named ($data),
#> for 18 samples. ($samples)
#> Includes associated STR database of 9 loci. ($db)
exstra_wgs_pcr_2[gene_region == "coding"]$db$locus
#> [1] "DRPLA" "HD" "SBMA" "SCA1" "SCA17" "SCA2" "SCA3" "SCA6" "SCA7"Similarly, this may be performed on sample information:
exstra_wgs_pcr_2[, group == "case"]
#> exstra_score object with 28928 observations of type named ($data),
#> for 16 samples. ($samples)
#> Includes associated STR database of 21 loci. ($db)For combining exstra_score objects. This may be useful if the BAM/CRAM files are analysed separately, but then are required to be analysed together in R. In the following dummy example, we split up the str_score variable into two samples, then recombine them.
data_08 <- str_score[, "WGSrpt_08"]
data_13 <- str_score[, "WGSrpt_13"]
( combined_data <- rbind_score_list(list(data_08, data_13)) )
#> exstra_score object with 4082 observations of type named ($data),
#> for 2 samples. ($samples)
#> Includes associated STR database of 21 loci. ($db)